O -Policy Temporal-Di erence Learning with Function Approximation
نویسندگان
چکیده
We introduce the rst algorithm for o -policy temporal-di erence learning that is stable with linear function approximation. O policy learning is of interest because it forms the basis for popular reinforcement learning methods such as Q-learning, which has been known to diverge with linear function approximation, and because it is critical to the practical utility of multi-scale, multi-goal, learning frameworks such as options, HAMs, and MAXQ. Our new algorithm combines TD( ) over state{action pairs with importance sampling ideas from our previous work. We prove that, given training under any -soft policy, the algorithm converges w.p.1 to a close approximation (as in Tsitsiklis and Van Roy, 1997; Tadic, 2001) to the action-value function for an arbitrary target policy. Variations of the algorithm designed to reduce variance introduce additional bias but are also guaranteed convergent. We also illustrate our method empirically on a small policyevaluation problem, showing reduced variance compared to the most obvious importance sampling algorithm for this problem. Our current results are limited to episodic tasks with episodes of bounded length. Although Q-learning remains the most popular of all reinforcement learning algorithms, it has been known since about 1996 that it is unsound with linear function approximation (see Gordon, 1995; Bertsekas and Tsitsiklis, 1996). The most telling counterexample, due to Baird (1995), is a seven-state Markov decision process with linearly independent feature vectors for which an exact solution exists, yet for which the approximate values found by Q-learning diverge to in nity. This problem prompted the development of residual gradient methods (Baird, 1995), which are stable but much slower than Q-learning, and tted value iteration (Gordon, 1995, 1999), which is also stable but limited to restricted, weaker-than-linear function approximators. Of course, Q-learning has been used with linear function approximation since its invention (Watkins, 1989), often with good results, but the soundness of this approach is no longer an open question. There exist non-pathological Markov decision processes for which it diverges; it is absolutely unsound in this sense. A sensible response is to turn to some of the other reinforcement learning methods, such as Sarsa, that are also eÆcient and for which soundness remains a possibility. An important distinction here is between methods that must follow the policy they are learning about, called on-policy methods, and those that can learn from behavior generated by a di erent policy, called o -policy methods. Q-learning is an o -policy method in that it learns the optimal policy even when actions are selected according to a more exploratory or even random policy. Q-learning requires only that all actions be tried in all states, whereas on-policy methods like Sarsa require that they be selected with speci c probabilities. Although the o -policy capability of Q-learning is appealing, it is also the source of at least part of its instability problems. For example, in one version of Baird's counterexample, the TD( ) algorithm, which underlies both Q-learning and Sarsa, is applied with linear function approximation to learn the action-value function Q for a given policy . Operating in an onpolicy mode, updating state{action pairs according to the same distribution with which they would be experienced under , this method is stable and convergent near the best possible solution (Tsitsiklis and Van Roy, 1997; Tadic, 2001). However, if state-action pairs are updated according to a di erent distribution, for instance, that generated by following the greedy policy, then the estimated values again diverge to in nity. This and related counterexamples suggest that at least some of the reason for the instability of Q-learning is the fact that it is an o -policy method; they also make it clear that this part of the problem can be studied in a purely policy-evaluation context. Despite these problems, there remains substantial reason for interest in o -policy learning methods. Several researchers have argued for an ambitious extension of reinforcement learning ideas into modular, multiscale, and hierarchical architectures (Sutton, Precup & Singh, 1999; Parr, 1998; Parr & Russell, 1998; Dietterich, 2000). These architectures rely on o -policy learning to learn about multiple subgoals and multiple ways of behaving from the singular stream of experience. For these approaches to be feasible, some eÆcient way of combining o -policy learning and function approximation must be found. Because the problems with current o -policy methods become apparent in a policy evaluation setting, it is there that we focus in this paper. In previous work we considered multi-step o -policy policy evaluation in the tabular case. In this paper we introduce the rst o -policy policy evaluationmethod consistent with linear function approximation. Our mathematical development focuses on the episodic case, and in fact on a single episode. Given a starting state and action, we show that the expected o -policy update under our algorithm is the same as the expected on-policy update under conventional TD( ). This, together with some variance conditions, allows us to prove convergence and bounds on the error in the asymptotic approximation identical to those obtained by Tsitsiklis and Van Roy (1997; Bertsekas and Tsitsiklis, 1996). 1. Notation and Main Result We consider the standard episodic reinforcement learning framework (see, e.g., Sutton & Barto, 1998) in which a learning agent interacts with a Markov decision process (MDP). Our notation focuses on a single episode of T time steps, s0; a0; r1; s1; a1; r2; : : : ; rT ; sT , with states st 2 S, actions at 2 A, and rewards rt 2 <. We take the initial state and action, s0 and a0, to be given arbitrarily. Given a state and action, st and at, the next reward, rt+1, is a random variable with mean rt st and the next state, st+1, is chosen with probabilities pt stst+1 . The nal state is a special terminal state that may not occur on any preceding time step. Given a state, st, 0 < t < T , the action at is selected according to probability (st; at) or b(st; at) depending on whether policy or policy b is in force. We always use to denote the target policy , the policy that we are learning about. In the on-policy case, is also used to generate the actions of the episode. In the o -policy case, the actions are instead generated by b, which we call the behavior policy. In either case, we seek an approximation to the actionvalue function Q : S A 7! < for the target policy : Q (s; a) = E rt+1 + + T rT j st = s; at = a ; where 0 1 is a discount-rate parameter. We consider approximations that are linear in a set of feature vectors f sag; s 2 S; a 2 A:
منابع مشابه
An Analysis of Temporal Di erence Learning with Function Approximation
We discuss the temporal di erence learning algorithm as applied to approximating the cost to go function of an in nite horizon discounted Markov chain The algorithm we analyze updates parameters of a linear function approximator on line during a single endless traject ory of an irreducible aperiodic Markov chain with a nite or in nite state space We present a proof of convergence with probabili...
متن کاملAn Analysis of Temporal - Di erence Learning with Function Approximation 1
We discuss the temporal-di erence learning algorithm, as applied to approximating the cost-to-go function of an in nite-horizon discounted Markov chain. The algorithm we analyze updates parameters of a linear function approximator on{line, during a single endless trajectory of an irreducible aperiodic Markov chain with a nite or in nite state space. We present a proof of convergence (with proba...
متن کاملLIDS - P - 2390 May 1997 Average Cost Temporal { Di erence Learning 1
We propose a variant of temporal{di erence learning that approximates average and di erential costs of an irreducible aperiodic Markov chain. Approximations are comprised of linear combinations of xed basis functions whose weights are incrementally updated during a single endless trajectory of the Markov chain. We present a proof of convergence (with probability 1), and a characterization of th...
متن کاملEvolutionary Algorithms for Reinforcement Learning
There are two distinct approaches to solving reinforcement learning problems, namely, searching in value function space and searching in policy space. Temporal di erence methods and evolutionary algorithms are well-known examples of these approaches. Kaelbling, Littman and Moore recently provided an informative survey of temporal di erence methods. This article focuses on the application of evo...
متن کاملStable Function Approximation in Dynamic Programming
The success of reinforcement learning in practical problems depends on the ability to combine function approximation with temporal di erence methods such as value iteration. Experiments in this area have produced mixed results; there have been both notable successes and notable disappointments. Theory has been scarce, mostly due to the difculty of reasoning about function approximators that gen...
متن کامل